1. What this document is about

This document attempts to provide insights into the following areas:

· Analysis of MSIL as an intermediate language

· Analysis of managed Vs. unmanaged Environment

The COM+ platform provides a universal environment for developing applications. It provides rich services for managed code, security, and interoperability. The system uses a simple stack-based abstract machine that is suitable for compiling a wide variety of languages. Programs are compiled to the abstract machine through a common intermediate language MSIL that can be executed on the virtual machine through several Just-in-time (JIT) compilers.

The different features of the COM+ environment come at a cost in size, speed, ease of programming etc. The costs may prevent it from being successful in certain environments. The goal of this report is to enumerate the trade-offs and try to quantify them.

We wanted to use real large applications for our analysis. Unfortunately, the COM+ software including the compiler is in its early stages I don’t think I would say “early stages”. It is probably more appropriate to say something like “pre-beta”. of development and cannot correctly compile large applications like Word. By modifying the sources to bypass compiler shortcomings, we were able to compile some applications like Vulcan but the applications did not execute correctly. We spent considerable amount of time looking for suitable programs. Finally we took programs from the SPEC 2000 benchmark suite that we could modify to successfully compile and run. The fact that this data was generated using small programs should be kept in mind while examining the results. Our recommendation is to do this analysis again when the system is more mature. Although we need large programs to really measure and understand the working set, cache behavior, and JIT complexities of the managed environment, the data from these programs still highlights several key points and provides a baseline for future reference.

IJW - You have focused this report purely on what we call our “IJW” scenario – recompiling existing C and C++ programs with little or no source code changes. This is not the “sweet spot” of our product. It takes advantage of virtually none of our key features. Before any final conclusions can be drawn about the applicability of our system one needs to measure a program tailored to our system – using primarily COM+ objects and the GC, using our class libraries rather than the C Runtime library, etc. In these scenarios, our technologies will really excel. The IJW scenario is designed to help people with migration of code into the managed world. It lowers the barrier to start using managed code because you can start with the code you have and evolve it incrementally. I would call this report and assessment of the COM+ C++ IJW scenario rather than an assessment of the COM+ technology on a whole. I think to extrapolate is very misleading.

WinCE - I am not going to comment on WinCE CEF stuff anywhere in this document. It is not something COM+ is supporting and is irrelevant to an analysis of COM+. I would prefer if you remove it from your analysis, but if you feel it is an important comparison point, please clarify that it is not connected to COM+.

Econo-JIT - I also am not going to comment on the Econo-JIT numbers. The econo-JIT has some very specific uses and performance of entire apps running under it is not important.

· First, it serves as an easy way to port COM+ to new platforms. It was designed such that a single dev could port it to a new platform within a few weeks. Our standard jit, on the other hand, is likely to take many man-months up to man-years to port to a new platform (depending on how different the new platform is).

· In principle, the Econo-JIT can also be used in environments of extremely constrained memory because it can JIT code quickly and discard code rejitting it on demand. However, extremely memory constrained environments is not a focus of V1 of the desktop URT, so we have not really worked on this scenario.

· The Econo-JIT is also valuable for compiling and running sections of a program where the amount of time spent executing a section of code does not warrant spending a lot of time JITing it. An example of this is startup code that is going to execute linearly (with no loops) and will only be executed one time. Given profile feedback, we could decide to use the standard JIT on routines that are heavily executed and the Econo-JIT on routines that use very little execution time. This is an area for work in future version of our product.

· Lastly, the Econo-JIT is a MUCH simpler piece of technology and is therefore much less prone to bugs than the standard JIT. This is useful in the development process for tracking down bugs (e.g. if a program crashes, run it in the Econo-JIT and see if it still does. If not the JIT is suspect). It is also useful for being able to make progress while waiting on a bug fix.

The Econo-JIT is not intended to be used in general benchmarks. We are not investing significant effort into making it run fast.

Install-O-JIT - The Install-O-JIT (or pre-jit) technology is a very important part of our product. Running a program is obviously going to be slower if you must first compile the program (from IL to native code) before you execute it. We are working hard on the pre-jit and, in fact, it does work in many situations. Unfortunately the IJW scenario is not one of them. The pre-jit work to support the IJW scenario is underway and will be completed within the next few weeks. Our current experience with it in other scenarios is that it cuts load time by as much as a factor of two (by eliminating JITing and class loading) but increases the working set significantly. The reason for the working set increase is that we have not yet done a good job isolating the fixups it must do. This will be an important area for tuning work between now and RTM. We will be publishing some benchmarks on the pre-jit within the next week. Please work with MaheshP on this.

Build 1423 - The COM+ build you did your performance analysis on had several significant performance regressions/bugs in it. We have re-run your numbers against a 1515 build. I have included the results below. It would be great if you could re-run your numbers against this build because it would be far more representative of where we really are. Of course there will be future improvements. We are just now approaching code complete and will be devoting most of our remaining product cycle to bug fixing and performance tuning. However the build you used had a few specific and egregious regressions from the work we have already done and it is a shame to publish that as “current fact”.

Image Size - Exe size is certainly an area where some of our features have a cost. Including the meta-data increases the size of an image and return gives some portability, version resilience, late binding, etc. It is a tax you pay even if you don’t need any of those features. In principle that tax can be reduced by decreasing the amount of meta-data if the user does not desire all of those features (however I’m not willing to guess the result so I am happy to let measurements without such improvements be used to judge our file size). We have not done a lot of work in that area, but we will be looking at that after code complete. If you wanted to make a comparison based on closer feature parity, you might consider running a comparison against an unmanaged program that has a type library. Type libraries don’t provide all of the features that meta-data does but they do, at least, provide late binding. The last time I checked excels type library was on the order of 600K. In addition, as I describe below there were a few bug in the build you used that inordinately increased the image size (for example the debug info was included in the meta-data). Lastly, it is worth noting the VC has a very complicated type system and this results in more meta-data for an equivalent program written in another language (like C#). For example the C++ language allows methods to be overloaded on the types int and long, however both of these are simply 32-bit signed values. C# makes no such distinction, but the C++ compiler must include additional meta-data to distinguish them in case some C++ consumer cares about the difference. You might consider trying some comparisons between unmanaged programs and equivalent C# programs.

See the spreadsheet file:\\bharry_1\public\spec2000\sizes.xls for our analysis on file size. This includes updates based on the 1515 build and some further analysis.

Execution perf – Execution performance of COM+ IL programs is affected by two issues – 1) the time to compile the program during startup and 2) reduced code quality due to using a JIT compiler. #1 will be substantially addressed by the Install-O-JIT described above. It’s worth talking a bit about point #2. It is clear the JIT generates worse code than the UTC compiler. After all it compiles programs more than 10X faster and uses many times less memory so it has significantly less opportunity to do optimizations. That said for main line integer code it is generally pretty good. There are many cases where it breaks down. The JIT’s floating point code has been notoriously bad (although we have improved it a lot in the last month or two). Crafty highlights the fast that the JIT’s code gen for 8 byte integers is also pretty bad. We have identified several easy optimizations that should significantly improve the performance here. In the long run Install-O-JIT provide some relief here as well. It is less constrained by time or working set than the JIT is. Unfortunately for schedule reasons the code generator in the Install-O-JIT is the JIT code generator. In a future version we will replace it with a compiler that is tuned to doing more optimizations.

Our analysis shows that across programs that focus primarily on code execution speed COM+ programs run 0% - 25% slower. We will be working hard on this between now and RTM. Our goal is to be between 0% and 20% slower with most programs in the 0% to 10% range. On the other hand, programs that use our class libraries and our managed memory frequently perform better.

Working set – Working set of COM+ programs is definitely higher that comparable unmanaged programs. Certainly one huge effect is the JIT. The class loader/late binding support also add overhead. We will be working hard to keep this cost below that of a comparable VB program but it is going to be somewhat higher than comparable VC programs.

Analysis – We have done a great deal of analysis on these programs. Look on file:\\bharry_1\public\spec2000 for some very detailed analysis of the results. In addition you will find copies of the source, make files and a copy of the build we used to reproduce the results. I am sending this feedback before everything is in place (because I have to leave for WHIPS in a few minutes). Maheshp will make sure all of the info is on the share shortly.

2. Terminology

We will refer to the native x86 binaries generated by the VC7 compiler as X86PE (/Os for space and /Ot for time), Managed binaries generated for the COM+2.0 runtime as ILPE, Un-managed binaries for the COM+ 2.0 runtime as UM-ILPE (/Os for space), and special annotated IL-PE binaries compiled with CEF compiler from WinCE Embedded development tools as OPTILPE. The integrated Universal Runtime build 1423.7 (Beta 1 Milestone) of 3/6/00 was used for the analysis.

Note that UM-ILPE binaries are compiled for the COM+ 2.0 environment using the ‘same’ source code as the native compilations with no modifications, except for including the appropriate headers and function prototype changes.

3. Analysis of MSIL as an intermediate language

In the COM+ environment, programs are compiled into PE binaries. The text section of these binaries consists of MSIL instructions. The meta-data for the program is stored in the read-only data section (rdata). In the native x86 environment, programs are also compiled to PE binaries but the text section consists of native x86 instructions and binaries do not contain any meta-data.

To measure the impact of compiling the programs in the COM+ environment using MSIL as the intermediate language, we compared static and dynamic characteristics of the native (x86PE) and the MSIL (ILPE, UM-ILPE) binaries for several programs. Different classes of applications (web applications, server applications, embedded applications) have their own requirements for size and time. For example, performance may be key for server applications, size and working set for embedded applications, ease of programming for web applications, and binary size for applications shipped over the net.

For static characteristics, we analyzed the size of two key sections (text and rdata) that contribute to the total size of the binary. Figure 1~~Figure 1~~shows the sizes of the sections. Note that there is no data for programs that failed during compilation for a certain flavor of PE binary (i.e. OPTIL-PE or ILPE). The total size of the MSIL binaries is about 74%-150% larger than native binaries. Our analysis on the 1515 build shows that the size increase ranges from 50% smaller to 42% bigger with an average of 24% bigger for the test cases. Further there is one particularly unfortunate bug in the compiler/linker that we were not able to get fixed for the 1515 build that causes COFF mangled names to be carried into the PE file. We have analyzed how much space this bug accounts for and the new numbers range from 51% smaller to 37% bigger with an average of 16% for the test cases. There were a variety of bugs in the build that you used that lead to the numbers that you saw. The most significant issue was that debugging symbolic info was included in every image (even those built for retail). This bug has been fixed in the 1515 build. The meta-data constitutes a large majority of it.The debugging info among other things account for a lot of the size. As the native binaries contain no meta-data, the size of the. rdata section increases by a scale of 3-5x. Besides contributing to the static size of the binary, meta-data is used by the JIT compiler at runtime. It is worth noting that when the pre-jit is used much of the meta-data is not touched during execution unless the program is doing late binding. This has an impact on the working set and cache behavior that may degrade execution time.

The text section of the MSIL binaries is larger than native binaries by 0-20% Our updated numbers show that the text section is smaller for LZDump and Vortex and larger for bzip2 and crafty with the average being 4.6% bigger. This could be due to MSIL not favoring compression as a design point or due to the inefficiencies in the code generated by the current compiler. For comparison, p-code binaries that built compression into the instruction’s encoding were 60% smaller than x86 native binaries. Simple compression schemes with encoding can be used to decrease the size of the text section. Appendix 1~~Appendix 1~~ outlines one such simple scheme that results in a 33% compression ratio. External compression and decompression techniques on the entire binary can also be utilized but they come at a cost in execution time and working set. I believe the current compiler generates reasonable IL. I am not aware of any significant bugs that will change the results here. I am, however, a little cautious of assuming everything in .text is IL. Have you guys looked to make sure there isn’t other stuff in there? We have not had a chance to investigate that. The point about compression is a very good one. We did work on compression about a year or so ago and the results were very impressive. Unfortunately we cut compression out of our V1 product because we feel we can add it in V2 and it just did not quite make the feature bar dictated by our intense desire to ship this product as quickly as we can.

See the spreadsheet file:\\bharry_1\public\spec2000\sizes.xls for our analysis on file size. This includes updates based on the 1515 build and some further analysis.

For dynamic characteristics, we compared two key features: working set and runtime. For runtime, we compared the speed of the native binaries with the MSIL binaries under regular and Econo JIT[1]. The MSIL binaries run slower than native binaries. Figure 1~~Figure 1~~ shows the execution time for these programs. For the regular JIT, the slowdown ranges from 10%-87%[2] Our numbers on the 1515 build are significantly different. Our results range from 1.55% to 42.53% slower. See file:\\bharry_1\public\spec2000\speed.xls for details of our results. for the programs we looked at. For the Econo-JIT, the slow down ranges from 4-6X[3]. As mentioned in the intro section execution performance is not a focus for the Econo-JIT because it is not particularly relevant for the ways we use it. Econo-JIT translates native code at a faster rate but with less quality resulting in poor execution time. Note that all the MSIL programs are running in unmanaged environment. The slowdown is to be expected for the following reasons:

BTW – we discovered and fixed a bug in the JIT that causes crafty to produce incorrect results. This fix did not make it into the 1515 build. We will produce a new build this week that has the fix in it.

· The native compiler generates code from its IL that is decorated with rich annotations like alias information. The JIT compiler, on the other hand, generates code from MSIL that does not have any annotations and with restrictions on time. MSIL has an option for adding annotations to the intermediate language. This is called OPTIL and has not yet been implemented in the current compilers of COM+. JIT compilers can use the annotations to generate better code. However, OPTIL annotations will further increase the size of the text section.

· Annotations have little or nothing to do with the difference. While annotations are an approach for getting better code quality while not sacrificing JIT performance. The fundamental issue here is the quality of the code and the time to JIT the code.

WinCE has taken a separate approach where they generate OPTIL (MSIL with annotations) and then use a translator to generate native binaries. The annotations aid the translator in generating good native code. As they don’t see any size advantage in MSIL, their approach is to compile everything to native with no managed code This is subtle feedback, but we use the term managed to refer to both IL and managed native – managed code compiled to CPU native instructions. People frequently equate IL one to one with managed code. They do not use any features of the managed environment. Basically, WinCE uses MSIL as an architectural neutral definition format, so they can ship one binary and then use translators to generate code for each CPU. As the OPTIL PE binary is loaded on a particular WinCE device, a translator is invoked to “compile” the entire binary into the appropriate native form. Figure 1~~Figure 1~~ shows the size of OPTIL binaries from WinCE. The OPTIL annotations increased the size of text section by 44-114%. Although we used the same program for compiling in the WinCE environment, the WinCE and COM+ compilers have departed from the common code base they started from.

· The cost of the JIT compiler, garbage collection, and impact of cache and working set, will degrade execution time. Why have you included garbage collection here? None of the tests you performed caused a single garbage collection to happen. The overhead of the GC was exactly 0 in every test. It would be great to have some tests that compare the GC to the NT heap, but we should only draw conclusions on things we have tested here. However, the programs, whose performance we measured for COM+, were unmanaged. They used the same libraries as native and do not incur cost for services like garbage collection. Again how do you know there is a cost for garbage collection. In fact our tests show it to be significantly faster than traditional heaps. Also, these programs are relatively small and their effect on working set is not totally visible as these benchmarks spend all their time in small portions of code.[4]

Figure 3~~Figure 3~~ shows the working set for a few programs that we managed to compile into MSIL. There is a steep increase at the beginning when the JIT compiler is generating code and then again at the end during shutdown. As these are small benchmarks, the JIT compiler is only needed at the beginning. If we had larger applications or if they used the COM+ facilities such as garbage collection we would see the spikes during the course of execution. How did you conclude this? Most of our programs are unmanaged programs so they are not invoking any managed services like garbage collection or managed exception handling. However, we do have two programs that generate SEH (Structure Exception Handling) exceptions. As expected, the working set for these programs increased significantly versus their native counterparts. For the two programs we profiled, most of the page faults were originated from the code addresses associated with the COM+ Execution Environment DLL. We should revisit this area as larger managed binaries become available.

Different JIT compilers have their own trade-offs: Econo-JIT is faster than the regular JIT compiler but generates low quality code. It may be suitable for code that is not performance critical. Regular JIT takes more time but produces better code and may be more suitable for performance critical code. As we don’t want the user to decide which JIT compiler to use, the system should automatically determine that. It is essential to have run time capabilities for collecting, analyzing and optimizing ILPE binaries. As mentioned in the intro the user won’t be selecting the one to use.

For systems like IA-64 where having scenario information is crucial for guiding optimizations like scheduling, we’ll notice even more degradation in JIT compiler generated native instructions if non-scenario based code generation techniques are used.

4. Managed Vs. Unmanaged Environment

The key benefits provided by the Managed environment is the opportunity to simplify the programming model We don’t consider this to be the only key benefit. For example we consider a secure computing environment to be another key benefit. For example, by having automatic garbage collection, the programming model allows the user to not worry about releasing memory. This helps reduce programming errors by eliminating memory leaks and dangling pointers. URT has also used this feature to simplify the interface with COM objects by removing all the code that is needed for initial handshake.

As we discussed in the previous section, services in the managed environment come at a cost in performance. You seem to be implying fundemental cost that I don’t understand. Managed images are a bit bigger because they require meta-data and for now codegen is worse because we only have a JIT compiler. For the environment to be successful we need the right programming model to drive the environment. We will now discuss the programming model.

Programming Model

Programmers move to a different programming paradigm when the pain is low and gains are high. Concepts like object-oriented programming and garbage collection have been known for decades and are well understood in languages like Smalltalk and Lisp. However, C++ was the first language when these concepts achieved major adoption. C++ became successful because it used C as its base language, maintained size and performance characteristics of C, and provided a programming model to write object-oriented programs. Java similarly used C++ as its base language and further simplified the programming model by providing garbage collection and convenient ways to interact with the Web. However, Java failed to maintain the same performance characteristics of C++. Performance has been one of the hurdles for Java’s wide adoption.

We have to consider similar factors while evaluating the programming models that we’ll present to our users. We currently have several programming models. One programming model allows the user to mix managed and unmanaged code. This model allows the user to add annotations to source code by tagging data structures as managed or unmanaged. This programming model adds complexity by requiring the user to keep track of what is managed and what is not. A simple error of missing to tag a data structure as managed would result in a correctly compiled program but with a memory leak (since the user assumes the system will garbage collect these objects). The figure below shows how this environment appears to the user. As we have lots of unmanaged code inside Microsoft, we should carefully examine this approach if we decide to move to the managed environment.

I agree there is some additional complexity for C++ programmers trying to straddle the line.

We also have pure programming models provided by languages like COOL that use C++ as the base language and provide a pure managed environment to operate in. As they will be in the same ballpark in performance of the existing managed environments, we should carefully consider what extra benefits does COOL provide over similar existing languages.

You seem to conclude here that performance of our environment will be bad. Are you alluding to something other than the temporary problem of lower quality codegen?

New programming model

If we want to attract new users to our platform, we should carefully consider new programming model and make sure it addresses critical needs. For example, we should consider the new opportunities provided by Internet devices in the embedded market. The current devices consist of multiple chips like DSPs, StrongARM, specialized processors etc. All these processors have to run concurrently. As performance is critical, the majority of the programming is currently done in assembly language and concurrency is handled manually. A common VM for this environment is very attractive and with the combination of a programming model that can handle concurrency we can provide critical value. However, we have to make sure that we also address the performance issue.

5. Summary

The COM+ environment presents an opportunity to provide a simple and powerful programming model for a wide variety of environments. In the current design, this will be at a cost in performance although its performance will be similar to other existing managed environments. For environments like Web applications (ASP, Web forms etc) where programming environment and interoperability are important, COM+ should be able to address them. Major enhancements will be needed for performance critical environments. It seems to me that simpler and more robust programs are valuable in most environments. Obviously there are some where performance and access to the “metal” are more important. I would not advocate writing device drivers as managed code. There is no value there. However to say that it is only applicable to ASP seems very narrow. There is no question that in V1 there are going to be several short comings in perf. Codegen will be slower, startup time will be longer and working set will be higher. Many of these issues are transient and are not fundamental to the model. NT performance, for example, has improved significantly with every release since the first. In fact the first release of NT had some serious perf problems. The real question is: Is the perf good enough for key scenarios and can the issues be fixed in subsequent releases.

Defining a good programming model is crucial and is an important issue that should be carefully considered. We should avoid the pitfall for addressing everything. New programming models like handling concurrency in the embedded environment should be considered.

Acknowledgements

COM+ runtime team and the WinCE team were extremely helpful during this study and patiently answered numerous questions. They helped us work around problems and provided latest copies of their software. Chris Fraser developed the simple MS-IL compression scheme. Ben Zorn, Chris Fraser, Dave Hanson participated in number of discussions. Our thanks to all.

Appendix 1: Compressing MS-IL code

MS-IL code can be much smaller. All it needs is better use of a method that it already uses, namely employing escape codes for rare instructions and abbreviations for the common ones. For example, the simple procedure below accepts a code base and infers a shorter version of MS-IL from it:

Find the one-byte instruction opcode S that’s used least in the code base.

Find the two-byte pair P that’s used the most. In order to allow direct interpretation, skip pairs that start in the middle of an instruction.

Replace each instance of the rare S with a two-byte escape sequence.

Replace each instance of the common pair P with the one-byte opcode formerly used for S.

Repeat until diminishing returns.

In a trial on a 13,000-line C program, even this primitive method saved 33%, by escaping 60 rare instructions and using their codes for common idioms. The first iteration, for example, escapes the floating-point conversion conv.r.un, which is used only three times, and gives its opcode to the first two bytes of the load-immediate ldc.i4+00, which appears 6,350 times. The table below adds some related measurements for this trial:

Bytes	Form
174,677	MS-IL instructions from the 12/99 “tech preview” compiler with full space optimization option (/Oxs).
140,655	After incorporating existing MS-IL abbreviations not emitted by the compiler above.
93,757	After the "pairing" procedure above. 33% smaller than the 140KB above.
54,089	With gzip instead of pairing. This format is not directly interpretable, but it indicates what might be gained by more sophisticated compression.
147,456	x86 native-code text segment from MSVC with full space opt. Included for comparison only.

Code size might be a secondary consideration for today’s crucial scenarios (x86 desktop systems with plenty of memory and network bandwidth), but MS-IL will eventually become our natural representation for code on other computing platforms, including handheld computers and smart cards, where memory, bandwidth, and their associated power drains make compression important.

In the long run, MS-IL should replace the x86 executable format, and we do not want the repeat Intel’s mistakes. We want to make certain Version 1 uses the best representation possible and that it extends gracefully (for example, to better compressors, which will surely emerge). If schedules preclude delivering code compression in Version 1 of the URT, then the executable format should at a bare minimum specify a compression format (initially the “null” compressor), so that code compression can be added when other platforms need it.

Figure 1: Static Analysis

Figure 1 (continued)

Figure 2: Performance measurements

Figure 33: Working Set

[1] Install JIT is currently being implemented, so we could not measure it. Install JIT compiles the entire binary at load time. But we did use a utility called PreJIT.EXE that creates a .ZAP file from a pure ILPE binary. This binary currently does not run, however, we only use it for static comparison analysis.

[2] For exception handling programs (Exception SEH and Exception Typed) we noticed a 7-9x slowdown in managed programs. The exception handling in managed environment is expensive as it does substantial book keeping.

[3] For crafty, the program was still running after over 5 minutes. As it did not end we left the entry blank. The ILPE version of Vortex crashes in both Regular and Econo-JIT.

[4] Transmeta uses a similar approach of translating x86 instruction on the fly and caching the results. Their speed is quite impressive. They use a large buffer to cache the results. On small programs, results are very predictable. However on large programs when working set is large, the performance degraded quite a bit. Increasing the buffer has implications on the lookup time.

COM+ Environment